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ABSTRACT 


'A 

The  Semi-Markov  Decision  model  is  considered  under 
the  criterion  of  long-run  average  cost.  A  new 
criterion,  which  for  any  policy  considers  the  limit 
of  the  expected  cost  incurred  during  the  first  n 
transitions  divided  by  the  expected  length  of  the 
first  n  transitions,  is  considered.  Conditions 
guaranteeing  that  an  optimal  stationary  (non- 
randomized)  policy  exist  are  then  presented.  It 
is  also  shown  that  the  above  criterion  is  equivalent 
to  the  usual  one  under  certain  conditions. 


AVERAGE  COST  SEMI-MARKOV  DECISION  PROCESSES 


by 

Sheldon  M.  Ross 


1.  INTRODUCTION 

A  process  Is  observed  at  tine  0  and  classified  into  some  state  x  e  g  . 
After  classification,  an  action  a  c  A  must  be  chosen.  Both  the  state  space 
X  and  the  action  space  A  are  assumed  to  be  Borel  subsets  of  complete,  separable 
metric  spaces. 

If  the  state  is  x  and  action  a  is  chosen,  then 

(i)  the  next  state  of  the  process  is  chosen  according  to  a  known  regular 
conditional  probability  measure  P(*  |  x,a)  on  the  Borel  sets  of  *  , 
and 

(ii)  conditional  on  the  event  that  the  next  state  is  y  ,  the  time  until 
the  transition  from  x  to  y  occurs  is  a  random  variable  with  known 
distribution  F(*  |  x,a,y)  .  After  the  transition  occurs,  an  action  is 
again  chosen  and  (i)  and  (ii)  are  repeated.  This  is  assumed  to  go  on 
indefinitely. 

We  further  suppose  that  a  cost  structure  is  imposed  on  the  model  in  the 
following  manner:  If  action  a  is  chosen  when  in  state  x  and  the  process 
makes  a  transition  t  units  later,  then  the  cost  incurred  by  time  s(s  <_  t) 
after  the  action  was  taken  is  given  by  a  known  real-valued  Baire  function 
C(s  |  x,a)  .+ 


^lf  one  allows  the  cost  to  also  depend  upon  the  next  state  visited,  then 
C(s  |  x,a)  should  be  interpreted  as  an  expected  cost. 
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In  order  to  ensure  that  transitions  do  not  take  place  too  quickly,  wc  shall 
need  to  assume  the  following: 

Condition  1: 

There  exists  6  >  0  ,  e  >  0  ,  such  that 

F(6  |  x,a,y)dP(y  |  x,a)  <  1  -  e  for  all  x  ,  a  . 

yex 

In  other  words,  Condition  1  asserts  that  for  every  state  x  and  action  a  there 
is  a  positive  probability  of  at  least  e  that  the  transition  time  will  be  greater 
than  6  . 

A  policy  n  is  any  measurable  rule  for  choosing  actions.  The  problem  is  to 
choose  a  policy  which  minimizes  the  expected  average  cost  per  time.  When  the  time 
between  transitions  is  identically  1  ,  then  the  process  is  called  a  Markov 
decision  process  and  has  been  extensively  studied  (see,  for  instance,  [2],  [S]  and 
[6]).  When  this  restriction  is  lifted,  we  have 
results  have  only  previously  been  given  for  the 
(see  [3]  and  [A]). 


a  semi-Markov  decision  process  and 
case  where  A  and  S  are  finite 
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2.  EQUALITY  OF  CRITERIA 

Let  and  a^  be  respectively  the  nth  .state  of  the  process  and  the  nth 

action  chosen,  n  ■  1,2,  ...  Also,  let  tn  be  the  time  between  the  (n  -  l)st 
and  the  nth  transition,  n  >_  1  . 

Furthermore,  let  Z(t)  denote  the  total  cost  incurred  by  t  ,  and  let  Z 

n 

be  the  cost  incurred  during  the  nth  transition  interval;  and  define  for  any 
policy  77 

*J(x)  -  Ito  eJ^  I  Xj  -  *] 


and 


$  (x)  *  lim 

71 

n-Mx> 


n 

E 

u 

l  Z,  1  *!  -  * 

i-1 

n 

E 

77 

l  Ti  1  Xi  *  x 
i-1  1 

Thus  $  and  $  both  represent,  in  some  sense,  the  average  expected  cost. 

1  2 
Though  4>  is  clearly  more  appealing,  it  will  be  criterion  $  that  we  shall 

deal  with.  Fortunately,  it  turns  out  that  under  certain  conditions  both  criterions 

are  identical. 


Definition: 

A  policy  is  said  to  be  stationary  if  the  action  it  chooses  only  depends  on 
the  present  state  of  the  system. 

The  reader  should  note  at  this  point  that  if  a  stationary  policy  is  employed 
then  the  process  {X(t),t  >_  0}  is  a  semi-Markov  process,  where  X(t)  represents 
the  state  of  the  process  at  time  t  . 

^0f  course,  Z(t)  and  Z  are  determined  by  X.  ,  a.  ,  t.  ,  i  >  1  . 

*  n  1  i  ’  x  i  ’  — 


For  any  initial  state  x  ,  let 


T  -  inf  (t  >  0  :  X(t)  -  x  ,  X(t')  +  x}  , 

and 

N  ■  min  {n  >  0  :  X  ,,  ■  x} 

tirl 

Hence,  T  is  the  time  of  the  first  return  to  state  x  and  N  is  the  number  of 
transitions  that  it  takes. 

Lemma  1: 

If  Condition  1  holds,  and  if  EJT  |  X^  ■  x]  <  »  ,  then  E^fN  |  «  x]  <  00 

N 

and  T  -  y  t  . 

n 


Proof; 


N 

By  the  definition  of  T  and  N  it  follows  that  T  >  [  i  ,  with  equality 

n-1 

holding  if  N  <  00  .  Now,  if  we  let 


T 

n 


if  t  <  6 
n  — 


with  probability  — - - -  if  t  >  6  , 

J  (1  -  F(6  |  x,y,a))dP(y  |  x,a) 

^  X  “  x  ,  a  ■  a 

n  n 


with  probability  1  — — - - -  if  t  >  <5  , 

J  (1  -  F(6  |  x,y,a))dP(y  |  x,a) 

^  X  ■  x  ,  a 

n  n 


then  it  follows  from  Condition  1  that  ,  n  ■  1,2,  ...  are  independent  and 
identically  distributed  with 

+ 

If  the  set  in  brackets  is  empty  then  take  N  to  be  00  ,  and  similarly  for  T  . 
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P { T  -  5}  -  E  -  1  -  P{l 

n  n 


Now,  from  Wald’s  equation  it  follows  that  if  EN 
N  N  _ 

that  ET  >  E  J  t  >eJt  =  °°  (since  x  <  r  ) 
—  ,  n  —  i  n  n  —  n 


0}  . 


N  _ 

“  then  E  7  t 
1  n 


00 


and  hence 


Q.E.D. 


Theorem  1: 

Assume  Condition  1.  If  tt  is  a  stationary  policy,  and  if  E^[T  |  X^  *  x]  <  °>  , 

then 


Ef[Z(T)  |  Xx  -  x] 
Ef [T  t  X:  -  x] 


Proof : 

Suppose  throughout  the  proof  that  =  x  .  Now,  under  a  stationary  policy 
{X(t),t  ^  0}  is  a  regenerative  process  with  regeneration  (or  cycle)  point  T  . 
Hence,  by  a  well  known  result 

4>^(i)  ■  E^[cost  incurred  during  a  cycle] /E^[ length  of  cycle] 

■  W-V  • 

Also,  it  is  easy  to  see  that  {X^  ,  n  -  1,2,  ...}  is  a  discrete  time  regenerative 
process  with  regeneration  time  N  .  Hence,  by  regarding  Z^  +  . . .  +  Z^  as  the 
"cost"  incurred  during  the  first  cycle  of  this  process,  it  follows  by  the  same 
well  known  result  that 


E 

TT 


m 


l 

n**l 


Z  /m 
n 


n=l 


Z  /E  N 
n  tt 


(1) 


as  m  -*■  00  , 
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where  we  have  used  Lemma  1  to  assert  that  E^N  <  00  .  However,  we  may  also  regard 
+  . . .  +  as  the  "cost"  incurred  during  the  first  cycle  and  hence,  by  the 
same  reasoning, 

n  N 

(2)  E  V  I  /m  -*■  E  y  t/EN  as  m  -*•  00  • 

ir  L.  n  i  n  i 

n=l  n-1 

By  combining  (1)  and  (2)  we  obtain 


N 

e  y  z 

*  n-1  n 
N 

E  I  T 

7T  “ ,  n 
n-1 


However,  since  N  <  <*»  (Lemma  1)  it  is  easy  to  see  that  [  Z  *  Z(T)  and 

_ *1  H 


n-1 


N 


£  r  ■  T  ,  and  the  result  follows. 


n-1 


Q.E.D. 


Remarks : 

It  also  follows  from  the  above  proof  that,  with  probability  1, 


n»5£i 

t-M» 


lim 


m 


i 


n-l 

m 


Z 

n 


En[Z(T) ] 
E  T 

TT 


Also,  suppose  that  the  initial  state  is  y  ,  y  t  x  .  When  is  it  true  that 
12  1 

$^(y)  -  (^(y)  -  <J>^(x)  ?  One  answer  is  that  if,  with  probability  1,  the  process 
will  eventually  enter  state  x  ,  then  {X(t),t  >_  0}  is  a  delayed  (or  general) 
regenerative  process,  and  the  proof  goes  through  in  an  identical  manner. 

Let 
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00 

T(x,a)  -  Si  tdF ( t  |  x,a,y)dP(y  |  x,a) 


y  ex  o 


and 


C(x 


00 

-a)  ■  J  J  C(t  I  x,a)dF(t  j  x,a,y)dP(y  |  x,a)  . 


ye*  0 


We  shall  suppose  Chat  both  C(x,a)  and  T(x,a)  exist  and  are  finite  for  all 
x  ,  a  . 

We  also  note  that  the  expected  cost  incurred  during  a  transition  interval  and 
the  expected  length  of  a  transition  interval  only  depend  on  the  parameters  of  the 
process  through  i(x,a)  ,  C(x,a)  and  P(‘  |  x,a)  ;  and,  as  a  result,  <f>  only 
depends  on  the  parameters  of  the  process  through  these  three  functions.  Thus,  we 
may  choose  the  cost  and  transition  time  distributions  in  as  convenient  a  manner  as 
possible;  and  hence  for  the  remainder  of  this  paper,  let  us  suppose  that 


and 


!'l  t  >_  t (x,a) 

0  t  <  t(x,a) 


C(t  |  x,a) 


t  <  t (x,a) 


t  >  r(x,a)  . 


That  is,  we  suppose  that  the  time  until  transition  is  (with  probability  1) 
t(x,a)  and  that  a  cost  of  C(x,a)  is  incurred  at  the  time  of  transition. 
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3.  AVERAGE  COST  RESULTS 


Theorem  2: 

Assuming  Condition  1,  if  there  exists  a  bounded  Baire  function  f(x)  ,  x  e  *  , 
and  a  constant  g  ,  such  that 


(3) 


f(x)  ■  min 
a 


C(x,a)  +  j  f(y)dP(y  |  x,a)  -  gx(x,a)] 
K 


x  e  * 


then,  for  any  policy  tr  which,  when  in  state  x  ,  selects  an  action  minimizing 
the  right  side  of  (3),  we  have 


2  2 

♦  *(x)  ■  g  *■  min  $  (x)  for  all  x  e  *  . 

*  7T 

7T  TT 


Proof: 


Let 


Si  "^xi»ai»  Xi,ai^  »  i  *  1,2,  •••  For  any  policy  it 


:„[j2  [£<V  -  E.(“V  1  si-i)1 


0  . 


But, 


I  S1_1]  -  f  f (y)dP(y  I  X11,a11) 


■  + 

j  f (y)dP(y  |  X 

K 

-  5(xl-l-ai-l) 

+  gT(Xi-l»ai-l 

-  min  |c(Xi_1,a)  + 

J  f (y)dP(y  | 

a  ( 

# 

-  5(xw.*h) 

+  gi(xi_1,ai_1; 

"“i-rVi1 


-  f(X1_1)  -  caw,lw)  +  gT(X1_l,a1_1)  , 
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with  equality  for  tt  ,  since  tr  takes  the  minimizing  actions.  Hence, 


<  E 

—  7T 


I  [f (x±)  -  f(X±-1)  +  C(X11,ai_1)  -  gx (Xi_1 »ai_1) 1 


i-2 


or 


E»  I  S(Xi-1,ai-1) 
81 - - 

Eir  J;2  T(Xi-l,ai-l) 

•k 

with  equality  for  ir  .  By  letting  n 
the  fact  that  Condition  1  implies  that 


E  [f (X  )  -  f(X  )) 

.  tt  n  l 

n  * 

E,  l  ’(xi-i'ai-i) 

i-2 

<*>  and  using  the  boundedness  of  f  and 
n 

Eij  ^  T^Xi  l*ai  1^  L  n  e  {  *  “  »  we  obtain 


E,  .1  s<xi-i,*i-i> 


g  <_  lim 


i-2 


n-H» 


E„  X  ;BWal-l> 

i-2 


*.2(xi> 


H 

with  equality  for  ir  and  for  all  possible  values  of 


Remarks ; 

The  above  proof  is  an  adaptation  of  one  given  in  [  6  ]  for  Markov  decision 
processes.  We  have  tacitly  assumed  that  a  rule  minimizing  the  right  side  of  (3) 
may  be  chosen  in  a  measurable  manner.  Clearly  a  sufficient  (but  by  no  means 
necessary)  condition  is  that  the  action  space  A  be  countable. 

In  order  to  determine  sufficient  conditions  for  the  existence  of  a  bounded 
function  f(x)  and  a  constant  g  satisfying  (3),  we  introduce  a  discount  factor 
o  ,  0  <  a  <  »  ,  and  continuously  discount  costs.  That  is,  we  suppose  that 
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a  cost  of  C  incurred  at  time  t  is  equivalent  to  a  cost  Ce 

time  0  . 

Let  ^(x)  denote  the  total  expected  discounted  cost  when 

and  the  initial  state  is  x  :  and  let  V  (x)  =  inf  V  (x)  .  Then 

’  a  it, a 

ir 

shown  by  standard  arguments  (see  [1])  that 


(4) 


V*> 


min 

a 


-at (x,a) 


■ 

C(x,a) 


+ 


J  Va(y)dP(y 
0 


Now,  fix  some  state — call  it  0 — and  define 

f  (x)  *=  V  (x)  -  V  (0)  . 
a  a  a 


From  (4),  we  obtain 


(5)  V  (0)  +  f  (x) 
a  a 


min 

a 


+ 


J  fQ(y)dP(y  |  x,a) 
0 


We  shall  need  the  following  condition: 


Condition  2: 

There  exists  an  M  <  »  ,  such  that 


C(x,a)  <_  Mx (x,a) 


incurred  at 

ir  is  employed, 
it  may  be 


for  all  x  ,  a  . 
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Theorem  3: 

Under  Conditions  1  and  2,  if  the  action  space  A  is  finite,  and  if 
{f^(x)  ,  0  <  a  <  c}  is  a  uniformly  bounded  equicontinuous  family  of  functions  for 
some  0  <  c  <  00  ,  then 

(i)  there  exists  a  bounded  continuous  function  f (x)  and  a  constant  g 
satisfying  (3); 

(ii)  for  some  sequence  a  -*■  0  ,  f(x)  *  lim  f  (x)  ; 

n-*»  n 

(iii)  lira  aV  (x)  ■  g  for  all  x  e  x  . 
a-K)  “ 

Proof : 

From  (5) ,  we  obtain  that 


Now,  by  the  Arzela-Ascoli  theorem  there  exists  a  sequence  ■+■  0  and  a 

continuous  function  f  such  that  lim  f  (x)  -  f  (x)  for  all  x  .  Also,  it 

a 

n-*»  n 

follows  from  Conditions  1  and  2  that  aVQ(0)  is  bounded,  and  hence  we  can  require 

that  lim  a  V  (0)  S  g  exists.  The  results  (i)  and  (ii)  then  follow  by  letting 
n  a 

n-*»  n 

an  0  in  (6)  and  using  Lebesgue's  dominated  convergence  theorem. 

The  proof  of  (iii)  is  identical  with  the  one  given  in  [6]. 
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4.  AN  EXAMPLE 

Suppose  that  batches  of  letters  arrive  at  a  post  office  at  a  Poisson  rate  A  , 
Suppose  further  that  each  batch  consists  of  j  letters  with  probability  , 
j  _>  1  ,  independently  of  each  other.  At  any  time,  a  truck  may  be  dispatched  to 
deliver  the  letters.  Assume  that  the  cost  of  dispatching  the  truck  is  K  ,  and 
also  that  the  cost  rate  when  there  are  j  letters  present  is  ,  an  increasing, 
positive,  bounded  sequence,  j  1  .  The  problem  is  to  choose  a  policy  minimizing 
the  long-run  average  cost. 

The  above  may  be  regarded  as  two  action  semi-Markov  decision  process  with 
states  1,2,3,  ...  ;  where  state  i  means  that  there  are  i  letters  presently 
in  the  post  office.  Action  1  is  "dispatch  a  truck"  and  action  2  is  "don't 
dispatch  a  truck."  (Note  that  since  a  truck  would  never  be  dispatched  if  there 
were  no  letters  in  the  post  office,  we  need  not  have  a  state  0  .) 

The  parameters  of  the  process  are: 


P(j/i,l)  -  Pj 

,  P(i  +  j/i,2) 

x(i,l)  *  1/A 

,  r(i,2)  -  1/A 

C(i,l)  -  K  + 

,  C(i,2) 

i 


Now,  if  we  let 


and  for  n  >  1 


e°^V  (i,n)  -  min 
01 


|K  +  m  +  £  .  i) ,  ^  P/„<1  +  J.n  -  i)j  , 


then  it  follows  by  induction  that  V  (i,n)  is  increasing  in  i  for  each  n  . 

a 

Also,  since  costs  are  bounded  and  the  discount  factor  e  <  1  ,  it  follows  that 
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V  (i)  *  lim  V  (i,n)  ,  and  hence  V  (i)  is  increasing.  Also,  VQ(i)  satisfies 
n 


(7) 


ea/XV  vi) 
a 


*  min  (K 


(k + ^ + X  v°o> ;  ^ + X  W1 +  3>| 


We  will  now  show  that  V  (i)  -  V  (1)  is  uniformly  bounded  and  hence  Theorem  3  is 

a  a 

applicable.  To  do  this,  we  consider  two  cases: 


Case  i: 


e“AV 


00 

(1)  -  K  +  ^  +  l  P,V 

A  j-l  1  ° 


(j)  • 


In  this  case,  we  have  by  (7)  that  V  (i)  <_  V^U)  and  hence,  by  monotonicity, 


V  (i)  -  V  (1) 
a  a 


for  all  i  . 


Case  ii: 


00 

eaAVa(l)  -  ^  PjVo(l  +  j)  . 


In  this  case,  we  have  by  (7)  that 

ea/XV  (1)  <  ea/XV  (i)  <  K  + 
a  —  a  — 

<_  K  + 

•  K  + 


T-+  J ,  Va<3> 

J-l 

^  +  l  P.V.O  +  1) 

A  j-l  1  * 

£121 .  Ciu  +  ,»AV  (1)  . 


Thus,  in  either  case  V^Ci)  “  V  (1)  is  uniformly  bounded  and  hence  by  Theorem  3 
there  exists  an  increasing  function  f(i)  and  a  constant  g  such  that 


14 


f(i)  -  min  |k  +  +  l  Fh(J)-f;^-+  £  P  h(j  +  1)  -  f 

(  A  J-l  J  J-l  J 

and  the  policy  which  chooses  the  minimizing  actions  is  optimal. 

Now,  if  we  let 

i*  -  min  ji  :  &&  +  J  P^htJ  +  i)  >  K  +  ^  Pjh(J)J  » 

then  it  follows  from  the  monotonicity  of  C(i)  and  h(i)  that  the  optimal  policy 

is  to  dispatch  a  truck  whenever  the  number  of  letters  in  the  post  office  is  at 

* 

least  i  ;  and  hence,  the  structure  of  the  optimal  policy  is  determined. 
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